March 25, 2015
The first step to data analysis: look at the data.
How to look at millions of loci (/variables/dimensions)?
(keyword: "dimensionality reduction")
Overview: 1. examples from the literature 2. how to do it: overview 3. properties of PCA 4. how to do it: hands-on
New insights into the Tyrolean Iceman's origin and phenotype as inferred by whole-genome sequencing, Keller et al 2011
Genomic divergence in a ring species complex, Alcaide et al 2014
Your data:
| Â | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 | SNP9 | SNP10 | SNP11 | SNP12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I1 | A/A | A/C | C/C | T/T | T/T | T/T | C/C | C/C | A/A | T/T | T/T | T/T |
| I2 | G/G | C/C | C/C | T/T | T/T | T/T | C/C | C/C | A/A | T/T | T/C | C/T |
| I3 | A/A | A/C | C/C | T/T | T/T | T/T | C/C | C/C | A/A | T/C | T/T | T/T |
| I4 | A/A | C/C | C/C | T/T | T/A | T/T | C/C | C/C | A/A | T/T | T/T | T/T |
| I5 | A/A | C/C | C/C | T/T | T/T | T/T | C/C | C/C | A/A | T/T | T/T | C/T |
Translate alleles to "number of reference alleles":
| Â | SNP1 | SNP2 | SNP3 | SNP4 | SNP5 | SNP6 | SNP7 | SNP8 | SNP9 | SNP10 | SNP11 | SNP12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I1 | 0 | 1 | 2 | 2 | 2 | 0 | 2 | 0 | 0 | 2 | 2 | 0 |
| I2 | 2 | 0 | 2 | 2 | 2 | 0 | 2 | 0 | 0 | 2 | 1 | 1 |
| I3 | 0 | 1 | 2 | 2 | 2 | 0 | 2 | 0 | 0 | 1 | 2 | 0 |
| I4 | 0 | 0 | 2 | 2 | 1 | 0 | 2 | 0 | 0 | 2 | 2 | 0 |
| I5 | 0 | 0 | 2 | 2 | 2 | 0 | 2 | 0 | 0 | 2 | 2 | 1 |
note: arbitrary choice of "reference" allele.
Default choice:
Trim tightly linked loci (e.g. with plink)
Subtract locus means (\(2 \times{}\) allele frequencies)
Estimate covariance between individuals
e.g. the covariance matrix, obtained by cov(x, use='pairwise.complete.obs') in R:
covmat <- cov(sweep(simdata$genotypes,1,rowMeans(simdata$genotypes),"-"))
| Â | I1 | I2 | I3 | I4 | I5 |
|---|---|---|---|---|---|
| I1 | 0.07152 | -0.1133 | 0.05636 | 0.004848 | -0.01939 |
| I2 | -0.1133 | 0.3306 | -0.1209 | -0.07394 | -0.02242 |
| I3 | 0.05636 | -0.1209 | 0.1245 | -0.02545 | -0.03455 |
| I4 | 0.004848 | -0.07394 | -0.02545 | 0.0897 | 0.004848 |
| I5 | -0.01939 | -0.02242 | -0.03455 | 0.004848 | 0.07152 |
note: you can apply this to any similarity matrix; we'll assume we're using the covariance matrix.
pca <- eigen(covmat)
pca$values/sum(pca$values)
## PC1 PC2 PC3 PC4 PC5 ## 6.392467e-01 2.129006e-01 1.040497e-01 4.380302e-02 -1.738786e-18
pca$vectors
## PC1 PC2 PC3 PC4 PC5 ## I1 0.32238797 -0.1901125 -0.1238318 0.80286295 0.4472136 ## I2 -0.85559045 -0.2272455 -0.1259946 0.02120874 0.4472136 ## I3 0.37286129 -0.6102290 0.0662623 -0.53310816 0.4472136 ## I4 0.15811541 0.5908315 -0.5965320 -0.26470232 0.4472136 ## I5 0.00222578 0.4367556 0.7800961 -0.02626121 0.4472136
The eigenvectors are the coordinates of the samples on the principal components: so just plot them against each other:
pairs( pca$vectors[,1:3] )
"The principal components are the axes that explain the most variance."
Technically, this means that the top \(k\) PCs give the best \(k\)-dimensional approximation to the simliarity matrix: with eigenvalues \(\lambda_i\), eigenvectors \(v_i\), and similiarity matrix \(\Sigma\):
$ {i=1}^k i v_i v_i^T $
More is true: if \(\tilde G\) is the matrix of genotypes, with row and column means subtracted off, then there are vectors of allele weights \(w_i\) so that
$ G {i=1}^k wi v_i^T .$
Another consequence: the PCs can be expressed as weighted sums of the genotypes. (so: admixed individuals show up intermediate to parentals)
you can choose to upweight low frequency alleles (e.g. divide by \(1/\sqrt{p(1-p)}\)); this may or may not make a difference.
large linkage blocks (e.g. inversions) can hijack the top PCs.
Solution: prune linked SNPs.
if the sampling scheme is unbalanced, the top PCs may choose to explain the variance within the most heavily sampled population, rather than variation between populations.
Solution: downweight by sample size:
weighted.covmat <- covmat * weights[col(covmat)] * weights[row(covmat)] eigen(weighted.covmat)
nind <- 1000 nloci <- 1e4 simdata <- sim_data(nind=nind,nloci=nloci) plot( simdata$locs, col=cols, pch=pchs, main="sampling scheme" )
norm.genotypes <- sweep( simdata$genotypes, 1, rowMeans(simdata$genotypes), "-" ) covmat <- cov(norm.genotypes) pca <- eigen(covmat) plot(pca$values/sum(pca$values),ylab="proportion", main="proportion of variance explained", xlab="PC#")
layout(t(1:2)) plot(simdata$locs, col=cols, pch=pchs, main="locations" ) plot(pca$vectors[,1:2], col=cols, pch=pchs, main="PC map", xlab="PC1", ylab="PC2" )
pairs(pca$vectors[,1:5], col=cols, pch=pchs )
The PCs here are Fourier modes; so the PC plots are like Lissajous figures (discussion in Novembre & Stephens).
Same data as in Genes mirror geography within Europe, but without subsetting:
popres/crossprod_all-covariance.tsv : covariance matrix for all chromosomespopres/crossprod_chr8-covariance.tsv : covariance matrix for just chromosome 8popres/indivinfo.tsv : sample informationsource(popres/indivinfo.R) : for country abbreviations and colorsYour tasks:
Everything is related to everything else, but near things are more related than distant things.
Waldo Tobler, 1970
Why?
Offspring tend to live near to their parents, and so neighbors tend to
From a population viewpoint, local drift differentiates populations, while migration makes nearby ones more similar.
From the individual viewpoint, lower population sizes means more recent, nearby shared ancestors; a tendency counteracted by migration.